[Issue #2590] Replace `gh` in analytics ETL #3393

widal001 · 2025-01-05T00:49:45Z

Summary

Replaces the sub-process call to the gh CLI by replacing it with a GitHubGraphqlClient class that can make calls to the GitHub GraphQL library directly from python.

Fixes #2590

Time to review: 10 mins

Changes proposed

What was added, updated, or removed in this PR.

Adds a GitHubGraphqlClient class that can make paginated calls to the GitHub GraphQL API
Replaces the calls to a separate sub-process in src/analytics/etl/github/main.py with the GitHubGraphqlClient
Removes the make-graphql-call.sh script that previously invoked the gh CLI
Removes the `

Context for reviewers

Testing instructions, background context, more in-depth details of the implementation, and anything else you'd like to call out or ask reviewers. Explain how the changes were verified.

Instructions to test

Checkout the PR locally
Re-build the docker image: make build
Run the GitHub data export and sprint reports end-to-end: make sprint-reports-with-latest-data

Notes

We'll want to refactor the src/analytics/integrations/github/ sub-package a little bit further pulling most of the code in the main.py file in that sub-package into src/analytics/etl/github.py instead.

I didn't include that in this PR to try to minimize the amount of code I was changing, but we can/should tackle that refactor in #3203 because some of the functions in main.py still write to the local file system, but can easily be updated to pass the exported data as a python dictionary.

Additional information

Screenshots, GIF demos, code examples or output to help show the changes working as expected.

The local run of sprint reports with the new code matches the output of the last run triggered by AWS step functions (using code in main) posted to slack:

Sprint report for HHS/13

In Slack (based on main)

Locally, based on this feature branch:

Sprint burndown for HHS/17

In Slack (based on main)

Locally, based on this feature branch:

Deliverable percent complete

In Slack (based on main)

Locally, based on this feature branch:

This test used to fail if the username contained a dot (e.g. `first.last`) This commit adjusts the regex to allow usernames with dots

Adds a class to make calls to the Github GraphQL API to replace gh CLI

To analytics.integrations.github.client

After the refactor, we no longer need them

widal001 · 2025-01-05T00:51:01Z

analytics/Dockerfile

@@ -18,7 +18,6 @@ RUN apt-get update \
    libpq-dev \
    postgresql \
    wget \
-    jq \


Removing jq because we no longer need it for transformations

widal001 · 2025-01-05T00:51:29Z

analytics/Dockerfile

-# Install gh CLI
-# docs: https://github.com/cli/cli/blob/trunk/docs/install_linux.md


Removing this script because we no longer need the gh CLI

widal001 · 2025-01-05T00:52:27Z

analytics/config.py

@@ -19,6 +19,7 @@ class DBSettings(PydanticBaseEnvConfig):
     ssl_mode: str = Field("require", alias="DB_SSL_MODE")
     db_schema: str = Field ("app", alias="DB_SCHEMA")
     slack_bot_token: str = Field(alias="ANALYTICS_SLACK_BOT_TOKEN")
+     github_token: str = Field(alias="GH_TOKEN")


Added this because we now need to reference it directly within the codebase, instead of indirectly like we did previously with the gh CLI

Since we are in this file, can we rename DBSettings to something more accurate

widal001 · 2025-01-05T00:53:02Z

analytics/local.env

 ###########################
 # Do not add these values to this file
 # to avoid mistakenly committing them.
 # Set these in your shell
 # by doing `export ANALYTICS_REPORTING_CHANNEL_ID=whatever`
 ANALYTICS_REPORTING_CHANNEL_ID=DO_NOT_SET_HERE
 ANALYTICS_SLACK_BOT_TOKEN=DO_NOT_SET_HERE
+GH_TOKEN=DO_NOT_SET_HERE


Prevents tests from failing if someone hasn't set their GitHub token locally.

widal001 · 2025-01-05T00:54:34Z

analytics/pyproject.toml

-  "ANN101",  # missing type annotation for self
-  "ANN102",  # missing type annotation for cls


Removed these because they've been removed in the latest version of ruff

widal001 · 2025-01-05T00:54:43Z

analytics/pyproject.toml

@@ -78,7 +76,6 @@ ignore = [
  "PTH123",  # `open()` should be replaced by `Path.open()`
  "RUF012",  # Mutable class attributes should be annotated with `typing.ClassVar`
  "TD003",   # missing an issue link on TODO
-  "PT004",   # pytest fixture leading underscore - is marked deprecated


Same with this one

widal001 · 2025-01-05T00:56:21Z

analytics/src/analytics/integrations/github/main.py

This files is basically a complete refactor, but preserves the existing helper functions for the export to prevent this PR from getting bigger than it already is.

widal001 · 2025-01-05T00:56:41Z

analytics/src/analytics/integrations/github/make-graphql-query.sh

Removes this because we no longer need it

widal001 · 2025-01-05T00:57:15Z

analytics/tests/logs/test_logging.py

@@ -40,7 +40,7 @@ def test_init(
        records = caplog.records
        assert len(records) == 2
        assert re.match(
-            r"^start test_logging: \w+ [0-9.]+ \w+, hostname \S+, pid \d+, user \d+\(\w+\)$",
+            r"^start test_logging: \w+ [0-9.]+ \w+, hostname \S+, pid \d+, user \d+\([\w\.]+\)",


Changed this because the tests were failing locally if there was a period in the username, e.g. billy.daly

coilysiren

I have (1) significant question about the data formatting, everything else looks fine

analytics/local.env

coilysiren · 2025-01-06T17:37:47Z

analytics/src/analytics/integrations/github/main.py

+        {
+            "project_owner": owner,
+            "project_number": project,
+            "issue_title": safe_pluck(item, "content.title"),


I don't understand why we need safe_pluck. If there's a bunch of fields missing, I would rather the code raise a keyerror, instead of getting us bad (eg. mostly null) data.

I can see the concern, but it's technically valid for all of these attributes to be empty in GitHub, except issue_title, issue_url and issue_opened_at. For example this issue has issue_type and issue_status but everything else is blank (e.g. sprint, parent, points, etc.)

We're currently validating the output data using the IssueMetada pydantic class when we parse these items in this step

We could have the non-nullable fields fail with a KeyError at this step, but the pydantic validation gives us better debugging output and allows us to gracefully continue with exporting and transforming the rest of the issues.

If we want to be more strict with what we consider "valid" data, I could see us requiring issue_type and issue_status as well.

Although since the logs are only retained for a limited amount of time, having issues without a type or status get "silently" dropped is often less helpful than having them with null data in Metabase.

The broader strategy around data quality and effectively handling a "dead letter queue" of bad data is the subject of this epic on data quality checks.

I'd definitely welcome your thoughts, though, on other potential strategies here as an intermediate step to implementing more robust data quality checks!

I think the basic things we're trying to achieve in the transform step are:

Prevent "bad" data from being inserted into the database (i.e. data that is missing required columns, or data that is missing optional columns because of a bug in the ETL -- the latter one is harder to check for)

Support inserts of data that are missing optional columns, when they are valid

Prevent failures of a subset of data from blocking loads of the remaining valid data

Typically I've achieved these goals by using tools like Great Expectations or Anomalo which run on the entire data set to check for quality issues or anomalies, but there might be immediate steps we can take right now to block bad data or catch more programming errors upfront.

Since we are already using Pydantic, we can use it's nested models feature:

https://stackoverflow.com/questions/70302056/define-a-pydantic-nested-model

Then drop this translation and safe_pluck layer entirely, and rely entirely on Pydantic to do the validation and null data transformation

Co-authored-by: kai [they] <[email protected]>

DavidDudas-Intuitial

LGTM. Nice work, especially on the tests

widal001 added 10 commits January 3, 2025 15:56

test: Fix test_logging.py for usernames with dots

13b94af

This test used to fail if the username contained a dot (e.g. `first.last`) This commit adjusts the regex to allow usernames with dots

feat: Adds a new GitHubGraphqlClient class

f075de0

Adds a class to make calls to the Github GraphQL API to replace gh CLI

feat: Adds GH_TOKEN to config settings

73a6f79

refactor: Updates code to use GH_TOKEN from settings

fe41e37

fix: Issue with new export functions

23a3478

refactor: Removes old script to make GitHub API calls

69d7109

ci: Removes deprecated ruff checks

26e3fd2

refactor: Renames analytics.integrations.github.github

7c3bbb0

To analytics.integrations.github.client

fix: Bug that was overwriting the deliverable status

538b009

refactor: Removes gh and jq from Dockerfile

a1e7969

After the refactor, we no longer need them

widal001 requested review from acouch and coilysiren as code owners January 5, 2025 00:49

widal001 requested review from DavidDudas-Intuitial and coilysiren and removed request for acouch and coilysiren January 5, 2025 00:50

widal001 commented Jan 5, 2025

View reviewed changes

coilysiren reviewed Jan 6, 2025

View reviewed changes

widal001 and others added 2 commits January 7, 2025 10:30

Update analytics/local.env

93e1012

Co-authored-by: kai [they] <[email protected]>

fix: Adds missing fields to transform step

e347dba

DavidDudas-Intuitial approved these changes Jan 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Issue #2590] Replace `gh` in analytics ETL #3393

[Issue #2590] Replace `gh` in analytics ETL #3393

widal001 commented Jan 5, 2025 •

edited

Loading

widal001 Jan 5, 2025

widal001 Jan 5, 2025

widal001 Jan 5, 2025

coilysiren Jan 6, 2025

widal001 Jan 5, 2025

widal001 Jan 5, 2025

widal001 Jan 5, 2025

widal001 Jan 5, 2025

widal001 Jan 5, 2025

widal001 Jan 5, 2025

coilysiren left a comment

coilysiren Jan 6, 2025

widal001 Jan 7, 2025 •

edited

Loading

widal001 Jan 7, 2025 •

edited

Loading

widal001 Jan 7, 2025

coilysiren Jan 7, 2025

DavidDudas-Intuitial left a comment

		# Install gh CLI
		# docs: https://github.com/cli/cli/blob/trunk/docs/install_linux.md

		"ANN101", # missing type annotation for self
		"ANN102", # missing type annotation for cls

[Issue #2590] Replace gh in analytics ETL #3393

Are you sure you want to change the base?

[Issue #2590] Replace gh in analytics ETL #3393

Conversation

widal001 commented Jan 5, 2025 • edited Loading

Summary

Time to review: 10 mins

Changes proposed

Context for reviewers

Instructions to test

Notes

Additional information

Sprint report for HHS/13

Sprint burndown for HHS/17

Deliverable percent complete

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coilysiren left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

widal001 Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

widal001 Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DavidDudas-Intuitial left a comment

Choose a reason for hiding this comment

[Issue #2590] Replace `gh` in analytics ETL #3393

[Issue #2590] Replace `gh` in analytics ETL #3393

widal001 commented Jan 5, 2025 •

edited

Loading

widal001 Jan 7, 2025 •

edited

Loading

widal001 Jan 7, 2025 •

edited

Loading